library(gtsummary)
trial
Germans Trias i Pujol Research Institute and Hospital (IGTP)
Badalona, Spain
March 26, 2025
1. Introduction
2. Descriptive tables
3. gtsummary
4. Descriptive graphics
5. ggplot2
6. Comparison between groups
6.1 Independent samples
6.2 Paired samples
6.3 Effect size
Techniques used to summarize categorical and numerical data of a sample or population in a meaningful way, both numerically and graphically.
It is a must do first step when conducting research before any inference.
Depending on the scope of a project, descriptive statistics alone may be sufficient, or they can be used to prepare further statistical analyses.
Descriptive statistics can be used to describe a single variable (univariate) or the relationship between more than one variable (bivariate/multivariate).
The variable type (numerical/categorical) determines which types of descriptive statistics we should use.
| trt | age | marker | stage | grade | response | death | ttdeath |
|---|---|---|---|---|---|---|---|
| Drug A | 23 | 0.160 | T1 | II | 0 | 0 | 24.00 |
| Drug B | 9 | 1.107 | T2 | I | 1 | 0 | 24.00 |
| Drug A | 31 | 0.277 | T1 | II | 0 | 0 | 24.00 |
| Drug A | NA | 2.067 | T3 | III | 1 | 1 | 17.64 |
| Drug A | 51 | 2.767 | T4 | III | 1 | 1 | 16.43 |
| N = 200 | |
|---|---|
| Chemotherapy Treatment, n (%) | |
| Drug A | 98 (49%) |
| Drug B | 102 (51%) |
| T Stage, n (%) | |
| T1 | 53 (27%) |
| T2 | 54 (27%) |
| T3 | 43 (22%) |
| T4 | 50 (25%) |
| Grade, n (%) | |
| I | 68 (34%) |
| II | 68 (34%) |
| III | 64 (32%) |
| Drug A N = 98 |
Drug B N = 102 |
|
|---|---|---|
| T Stage, n (%) | ||
| T1 | 28 (29%) | 25 (25%) |
| T2 | 25 (26%) | 29 (28%) |
| T3 | 22 (22%) | 21 (21%) |
| T4 | 23 (23%) | 27 (26%) |
| Grade, n (%) | ||
| I | 35 (36%) | 33 (32%) |
| II | 32 (33%) | 36 (35%) |
| III | 31 (32%) | 33 (32%) |
Numerical variables (continuous or discrete) are described by a central tendency statistic to know which value is the most “typical”, together with a measure of how spread out observations are around this value (variability).
The three most common measures of central tendency are:
→ Mean: average of all observations.
\[ \bar{x} = \frac{\sum x_i}{n} \]
→ Median: middle observation when sorted in order from least to greatest.
→ Mode: value that appears most often.
[1] 23 9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
[26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
[51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
[76] 54 67 43 54 41 34 34 6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"
[1] 23 9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
[26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
[51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
[76] 54 67 43 54 41 34 34 6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"
→ Mean:
[1] 23 9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
[26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
[51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
[76] 54 67 43 54 41 34 34 6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"
→ Median:
[1] 23 9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
[26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
[51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
[76] 54 67 43 54 41 34 34 6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"
→ Mode:
The most common measures of variability are:
\[ \text{SD} = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n - 1}} \]
Range: difference between the maximum and minimum values.
Interquartile range: difference between the percentile 75th (Q3) and the percentile 25th (Q1).
A p-th percentile is the value below which there is a given percentage p of observations with an equal or lower value. The percentiles 25th, 50th and 75th are called quartiles and divide the values into four equal parts. The second quartile 50th (Q2) is equal to the median.
[1] 23 9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
[26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
[51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
[76] 54 67 43 54 41 34 34 6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"
[1] 23 9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
[26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
[51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
[76] 54 67 43 54 41 34 34 6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"
→ Standard Deviation:
[1] 23 9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
[26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
[51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
[76] 54 67 43 54 41 34 34 6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"
→ Range:
[1] 23 9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
[26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
[51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
[76] 54 67 43 54 41 34 34 6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"
→ Percentiles:
[1] 23 9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
[26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
[51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
[76] 54 67 43 54 41 34 34 6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"
→ Percentiles:
[1] 23 9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
[26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
[51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
[76] 54 67 43 54 41 34 34 6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"
→ Interquartile range:
[1] 23 9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
[26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
[51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
[76] 54 67 43 54 41 34 34 6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"
Mean - 2*SD = -0.803
Mean - 2*SD = -0.803
→ Normally distributed variables are normally described with the mean and standard deviation.
→ Non-normal variables are normally described with the median and the first and third quartiles (Q1 and Q3) or the interquartile range.
| N = 200 | |
|---|---|
| Age, Mean (SD) | 47 (14) |
| Marker Level (ng/mL), Median (Q1, Q3) | 0.64 (0.22, 1.41) |
→ Normally distributed variables are normally described with the mean and standard deviation.
→ Non-normal variables are normally described with the median and the first and third quartiles (Q1 and Q3) or the interquartile range.
| Drug A N = 98 |
Drug B N = 102 |
|
|---|---|---|
| Age, Mean (SD) | 47 (15) | 47 (14) |
| Marker Level (ng/mL), Median (Q1, Q3) | 0.84 (0.23, 1.60) | 0.52 (0.18, 1.21) |
R package that provides an elegant and flexible way to create publication-ready analytical and summary tables.
It allows to summarize data sets, regression models, and more.
The tbl_summary() function calculates descriptive statistics for continuous and categorical columns of a dataframe or tibble.
Sjoberg DD, Whiting K, Curry M, Lavery JA, Larmarange J. Reproducible summary tables with the gtsummary package. The R Journal 2021;13:570–80. https://doi.org/10.32614/RJ-2021-053.
Four types of summaries: continuous, continuous2, categorical, and dichotomous
Variables coded 0/1, FALSE/TRUE, No/Yes treated as dichotomous
Statistics are median (p25, p75) for continuous, n (%) for categorical/dichotomous
NA values will be shown as “Unknown”
Label attributes are printed automatically
| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
|---|---|---|
| Age | 46 (37, 60) | 48 (39, 56) |
| Unknown | 7 | 4 |
| Marker Level (ng/mL) | 0.84 (0.23, 1.60) | 0.52 (0.18, 1.21) |
| Unknown | 6 | 4 |
| Grade | ||
| I | 35 (36%) | 33 (32%) |
| II | 32 (33%) | 36 (35%) |
| III | 31 (32%) | 33 (32%) |
| Tumor Response | 28 (29%) | 33 (34%) |
| Unknown | 3 | 4 |
| 1 Median (Q1, Q3); n (%) | ||
| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
|---|---|---|
| Age | 46 (37, 60) | 48 (39, 56) |
| Marker Level (ng/mL) | 0.84 (0.23, 1.60) | 0.52 (0.18, 1.21) |
| Grade | ||
| I | 35 (36%) | 33 (32%) |
| II | 32 (33%) | 36 (35%) |
| III | 31 (32%) | 33 (32%) |
| Tumor Response | 28 (29%) | 33 (34%) |
| 1 Median (Q1, Q3); n (%) | ||
| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
|---|---|---|
| Age | 46 (37, 60) | 48 (39, 56) |
| Marker Level (ng/mL) | 0.84 (0.23, 1.60) | 0.52 (0.18, 1.21) |
| Grade | ||
| I | 35 (51%) | 33 (49%) |
| II | 32 (47%) | 36 (53%) |
| III | 31 (48%) | 33 (52%) |
| Tumor Response | 28 (46%) | 33 (54%) |
| 1 Median (Q1, Q3); n (%) | ||
| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
|---|---|---|
| Patient Age | 46 (37, 60) | 48 (39, 56) |
| Marker Level (ng/mL) | 0.84 (0.23, 1.60) | 0.52 (0.18, 1.21) |
| Grade | ||
| I | 35 (51%) | 33 (49%) |
| II | 32 (47%) | 36 (53%) |
| III | 31 (48%) | 33 (52%) |
| Tumor Response | 28 (46%) | 33 (54%) |
| 1 Median (Q1, Q3); n (%) | ||
| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
|---|---|---|
| Patient Age | 47 (15) | 47 (14) |
| Marker Level (ng/mL) | 0.84 (0.23, 1.60) | 0.52 (0.18, 1.21) |
| Grade | ||
| I | 35 (51%) | 33 (49%) |
| II | 32 (47%) | 36 (53%) |
| III | 31 (48%) | 33 (52%) |
| Tumor Response | 28 (46%) | 33 (54%) |
| 1 Mean (SD); Median (Q1, Q3); n (%) | ||
| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
|---|---|---|
| Patient Age | 47 (15) | 47 (14) |
| Marker Level (ng/mL) | 0.84 (0.23, 1.60) | 0.52 (0.18, 1.21) |
| Grade | ||
| I | 35 (51%) | 33 (49%) |
| II | 32 (47%) | 36 (53%) |
| III | 31 (48%) | 33 (52%) |
| Tumor Response | ||
| 0 | 67 (51%) | 65 (49%) |
| 1 | 28 (46%) | 33 (54%) |
| 1 Mean (SD); Median (Q1, Q3); n (%) | ||
Vignette of tbl_summary(): https://www.danieldsjoberg.com/gtsummary/articles/tbl_summary.html
{gtsummary} full presentation (Daniel D. Sjoberg): https://www.danieldsjoberg.com/clinical-reporting-gtsummary-rmed/material.html
Warning
Pie charts are not a good way of describing categorical data because they become difficult to read as the number of categories increases.
The objective is usually to visualize the shape of a variable distribution.
The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.
The objective is usually to visualize the shape of a variable distribution.
The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.
The objective is usually to visualize the shape of a variable distribution.
The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.
The objective is usually to visualize the shape of a variable distribution.
The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.
The objective is usually to visualize the shape of a variable distribution.
The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.
The objective is usually to visualize the shape of a variable distribution.
The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.
The objective is usually to visualize the shape of a variable distribution.
The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.
The objective is usually to visualize the shape of a variable distribution.
The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.
The objective is usually to visualize the shape of a variable distribution.
The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.
Is the most popular R package for producing visualizations of data.
Unlike many graphics packages, ggplot2 uses a conceptual framework based on the grammar of graphics.
It’s part of the tidyverse universe, but uses + instead of a pipe operator (|> or %>%).
library(ggplot2)
#Define data and mapping:
ggplot(data = trial, aes(x = stage)) +
#Create bar plot layer:
geom_bar(fill = "grey", color = "black", width = .8) +
#Change x scale:
scale_x_discrete(name = "T Stage") +
#Change y scale:
scale_y_continuous(name = "Counts", limits = c(0, 60), breaks = seq(0, 60, by = 10))library(ggplot2)
#Define data and mapping:
ggplot(data = trial, aes(x = stage)) +
#Create bar plot layer:
geom_bar(fill = "grey", color = "black", width = .8) +
#Change x scale:
scale_x_discrete(name = "T Stage") +
#Change y scale:
scale_y_continuous(name = "Counts", limits = c(0, 60), breaks = seq(0, 60, by = 10)) +
#Apply a black & white theme:
theme_bw()library(ggplot2)
#Define data and mapping:
ggplot(data = trial, aes(x = stage, fill = trt)) +
#Create bar plot layer:
geom_bar(color = "black", width = .8) +
#Change x scale:
scale_x_discrete(name = "T Stage") +
#Change y scale:
scale_y_continuous(name = "Counts", limits = c(0, 60), breaks = seq(0, 60, by = 10)) +
#Apply a black & white theme:
theme_bw()library(ggplot2)
#Define data and mapping:
ggplot(data = trial, aes(x = stage, fill = trt)) +
#Create bar plot layer:
geom_bar(color = "black", width = .8) +
#Change x scale:
scale_x_discrete(name = "T Stage") +
#Change y scale:
scale_y_continuous(name = "Counts", limits = c(0, 60), breaks = seq(0, 60, by = 10)) +
#Change fill legend:
scale_fill_discrete(name = "Treatment") +
#Apply a black & white theme:
theme_bw()library(ggplot2)
#Define data and mapping:
ggplot(data = trial, aes(x = stage, fill = trt)) +
#Create bar plot layer:
geom_bar(color = "black", width = .8, position = position_dodge()) +
#Change x scale:
scale_x_discrete(name = "T Stage") +
#Change y scale:
scale_y_continuous(name = "Counts", limits = c(0, 60), breaks = seq(0, 60, by = 10)) +
#Change fill legend:
scale_fill_discrete(name = "Treatment") +
#Apply a black & white theme:
theme_bw()library(ggplot2)
#Define data and mapping:
ggplot(data = trial, aes(x = stage, fill = trt)) +
#Create bar plot layer:
geom_bar(color = "black", width = .8, position = position_dodge()) +
#Change x scale:
scale_x_discrete(name = "T Stage") +
#Change y scale:
scale_y_continuous(name = "Counts", limits = c(0, 30), breaks = seq(0, 30, by = 5)) +
#Change fill legend:
scale_fill_discrete(name = "Treatment") +
#Apply a black & white theme:
theme_bw()library(ggplot2)
#Define data and mapping:
ggplot(data = trial, aes(x = stage, fill = trt)) +
#Create bar plot layer:
geom_bar(color = "black", width = .8, position = "fill") +
#Change x scale:
scale_x_discrete(name = "T Stage") +
#Change y scale:
scale_y_continuous(name = "Percentage") +
#Change fill legend:
scale_fill_discrete(name = "Treatment") +
#Apply a black & white theme:
theme_bw()library(ggplot2)
#Define data and mapping:
ggplot(data = trial, aes(x = stage, fill = trt)) +
#Create bar plot layer:
geom_bar(color = "black", width = .8, position = "fill") +
#Change x scale:
scale_x_discrete(name = "T Stage") +
#Change y scale:
scale_y_continuous(name = "Percentage", labels = scales::percent) +
#Change fill legend:
scale_fill_discrete(name = "Treatment") +
#Apply a black & white theme:
theme_bw()#Define data and mapping:
ggplot(data = trial, aes(x = age)) +
#Create histogram plot layer:
geom_histogram(binwidth = 5, fill = "grey", color = "black") +
#Change x scale:
scale_x_continuous(name = "Age") +
#Change y scale:
scale_y_continuous(name = "Counts") +
#Apply a black & white theme:
theme_bw()#Define data and mapping:
ggplot(data = trial, aes(x = age, fill = trt)) +
#Create histogram plot layer:
geom_histogram(binwidth = 5, color = "black") +
#Change x scale:
scale_x_continuous(name = "Age") +
#Change y scale:
scale_y_continuous(name = "Counts") +
#Apply a black & white theme:
theme_bw()#Define data and mapping:
ggplot(data = trial, aes(x = age, fill = trt)) +
#Create histogram plot layer:
geom_histogram(binwidth = 5, color = "black") +
#Change x scale:
scale_x_continuous(name = "Age") +
#Change y scale:
scale_y_continuous(name = "Counts") +
#Change fill legend:
scale_fill_discrete(name = "Treatment") +
#Apply a black & white theme:
theme_bw()library(ggplot2)
#Define data and mapping:
ggplot(data = trial, aes(x = age)) +
#Create histogram plot layer:
geom_density(alpha = .6, fill = "#68abb8") +
#Change x scale:
scale_x_continuous(name = "Age", limits = c(0, 90), breaks = c(0, 25, 50, 75)) +
#Change y scale:
scale_y_continuous(name = "Density")#Define data and mapping:
ggplot(data = trial, aes(x = age)) +
#Create histogram plot layer:
geom_density(alpha = .6, fill = "#68abb8") +
#Change x scale:
scale_x_continuous(name = "Age", limits = c(0, 90), breaks = c(0, 25, 50, 75)) +
#Change y scale:
scale_y_continuous(name = "Density") +
#Apply a black & white theme:
theme_bw()#Define data and mapping:
ggplot(data = trial, aes(x = age, fill = trt)) +
#Create histogram plot layer:
geom_density(alpha = .6) +
#Change x scale:
scale_x_continuous(name = "Age", limits = c(0, 90), breaks = c(0, 25, 50, 75)) +
#Change y scale:
scale_y_continuous(name = "Density") +
#Apply a black & white theme:
theme_bw()#Define data and mapping:
ggplot(data = trial, aes(x = age, fill = trt)) +
#Create histogram plot layer:
geom_density(alpha = .6) +
#Change x scale:
scale_x_continuous(name = "Age", limits = c(0, 90), breaks = c(0, 25, 50, 75)) +
#Change y scale:
scale_y_continuous(name = "Density") +
#Change fill legend:
scale_fill_discrete(name = "Treatment") +
#Apply a black & white theme:
theme_bw()#Define data and mapping:
ggplot(data = trial, aes(y = age)) +
#Create box plot layer:
geom_boxplot(alpha = .6, fill = "#68abb8") +
#Change x scale:
scale_x_continuous(limits = c(-1, 1)) +
#Change y scale:
scale_y_continuous(name = "Age", limits = c(0, 90)) +
#Apply a black & white theme:
theme_bw()#Define data and mapping:
ggplot(data = trial, aes(y = age)) +
#Create box plot layer:
geom_boxplot(alpha = .6, fill = "#68abb8") +
#Change x scale:
scale_x_continuous(limits = c(-1, 1)) +
#Change y scale:
scale_y_continuous(name = "Age", limits = c(0, 90)) +
#Apply a black & white theme:
theme_bw() +
#Apply another theme to remove the x axis ticks and labels
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())#Define data and mapping:
ggplot(data = trial, aes(x = trt, y = age, fill = trt)) +
#Create box plot layer:
geom_boxplot(alpha = .6) +
#Change y scale:
scale_y_continuous(name = "Age", limits = c(0, 90)) +
#Change fill legend:
scale_fill_discrete(name = "Treatment") +
#Apply a black & white theme:
theme_bw() #Define data and mapping:
ggplot(data = trial, aes(x = trt, y = age, fill = trt)) +
#Create box plot layer:
geom_boxplot(alpha = .6) +
#Change x scale:
scale_x_discrete(name = "Treatment") +
#Change y scale:
scale_y_continuous(name = "Age", limits = c(0, 90)) +
#Apply a black & white theme:
theme_bw() +
#Remove legend:
theme(legend.position="none")Vignette of the package: https://ggplot2.tidyverse.org/articles/ggplot2.html
Useful recipes: http://www.cookbook-r.com/Graphs/
For example:
A researcher may compare the performance of students in two different schools, or compare the performance of students in two different grade levels.
A researcher may compare the same group of people before and after taking a medication or compare the productivity of employees before and after a training program.
Independent samples:
Independent samples are samples that are selected randomly so that its observations do not depend on the values other observations.
For example, if the men’s group and the women’s group are asked about their health status.
Paired samples:
In a dependent sample, the measures are related.
For example, if you take a sample of patients who have had a painkiller and ask them about their pain before and after taking the medicine
When to use
The samples come from normally distributed populations.
If the populations have unequal variances, the Welch modification is used.
When to use
Hypothesis Testing:
Null Hypothesis (H₀): The means are equal across groups.
Alternative Hypothesis (H₁): The means are different across groups.
Formula: equal variance vs unequal variance
\[ t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \]
\[ t_\text{Welch} = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{(s_1)^2}{n_1} + \frac{(s_2)^2}{n_2}}} \]
\(\bar{x}_1, \bar{x}_2 = \text{Sample means}\)
\(s_1, s_2 = \text{Sample standard deviation}\)
\(s_p = \text{Pooled standard deviation}\)
\(n_1, n_2 = \text{Sample size}\)
Welch Two Sample t-test
data: age by trt
t = -0.2093, df = 184.19, p-value = 0.8344
alternative hypothesis: true difference in means between group Drug A and group Drug B is not equal to 0
95 percent confidence interval:
-4.566621 3.690640
sample estimates:
mean in group Drug A mean in group Drug B
47.01099 47.44898
gtsummary:Use test = age ~ "t.test" to apply an independent t-test to compare age means between treatment groups.
Hypothesis Testing:
Null Hypothesis (H₀): The distributions of both groups are equal.
Alternative Hypothesis (H₁): The distributions of both groups are different.
Formula:
\[W = R_1 - \frac{n_1(n_1 +1)}{2}\]
\(R_1 = \text{Sum of ranks for the reference group}\)
\(n_1 = \text{Number of observations in the reference group}\)
Wilcoxon rank sum test with continuity correction
data: age by trt
W = 4323, p-value = 0.7183
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-4.999980 3.999954
sample estimates:
difference in location
-0.9999612
gtsummary:| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
p-value2 |
|---|---|---|---|
| Age | 46 (37, 60) | 48 (39, 56) | 0.7 |
| Unknown | 7 | 4 | |
| Marker Level (ng/mL) | 0.84 (0.23, 1.60) | 0.52 (0.18, 1.21) | 0.085 |
| Unknown | 6 | 4 | |
| 1 Median (Q1, Q3) | |||
| 2 Wilcoxon rank sum test | |||
By default, add_p() uses the Wilcoxon rank-sum test for continuous variables.
When to use
Compares observed frequencies to expected frequencies.
Appropriate when sample sizes are large (expected cell counts are ≥ 5).
When to use
\[E = \frac{(\text{Row total} \times \text{Column total})}{\text{Grand total}} = \frac{28*33}{48} = 19.25\]
Hypothesis Testing:
Null Hypothesis (H₀): There is no association between the categorical variables.
Alternative Hypothesis (H₁): An association exists between the categorical variables.
Formula:
\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\]
\(O_i = \text{Observed frequency}\)
\(E_i = \text{Expected frequency}\)
trt) and cancer stage (stage).gtsummary:The default test for most categorical data is the chi-square test.
If expected counts are low (<5), Fisher’s exact test is used instead.
When to use
When to use
Hypothesis Testing:
Null Hypothesis (H₀): Mean difference between the paired samples is zero.
Alternative Hypothesis (H₁): Mean difference between the paired samples is not equal to zero.
Formula:
\[t = \frac{\bar{d}}{\frac{s_d}{\sqrt{n}}}\]
\(\bar{d} = \text{Mean of differences}\)
\(s_d = \text{Standard deviation of differences}\)
\(n = \text{Number of pairs}\)
| extra | group | ID |
|---|---|---|
| 0.7 | 1 | 1 |
| 1.9 | 2 | 1 |
| -1.6 | 1 | 2 |
| 0.8 | 2 | 2 |
Hypothesis Testing:
Null Hypothesis (H₀): The differences between paired observations are symmetrically distributed around zero.
Alternative Hypothesis (H₁): The differences are not symmetrically distributed around zero.
Formula:
\[ W = \sum_{i=1}^{N_r} \left[ \operatorname{sgn}(x_{2,i} - x_{1,i}) \cdot R_i \right] \]
\(x_{1,i}, x_{2,i} = \text{paired ranks from two different distributions}\)
\(R_i = \text{rank } i\)
Wilcoxon signed rank test with continuity correction
data: pre_interv and post_interv
V = 0, p-value = 0.009091
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-2.949921 -1.050018
sample estimates:
(pseudo)median
-1.400031
When to use
Compare proportions or frequencies between two related groups.
Determine if the proportion of disagreement between two conditions is equal
Hypothesis Testing:
Null Hypothesis (H₀): The proportions of discordant pairs are equal.
Alternative Hypothesis (H₁): The proportions of discordant pairs differ.
Formula:
\[ \chi^2 = \frac{(b - c)^2}{b + c} \]
# Creating a categorical version of the variable `extra`
sleep <- sleep |>
mutate(extra_cat = case_when(extra >= 0 ~ "Positive",
extra < 0 ~ "Negative"),
extra_cat = factor(extra_cat))
# Creating the objects pre & post intervention
pre_interv_cat <- sleep |> filter(group == 1) |> pull(extra_cat)
post_interv_cat <- sleep |> filter(group == 2) |> pull(extra_cat)
# McNemar Test
mcnemar.test(pre_interv_cat, post_interv_cat)gtsummaryLink: https://www.danieldsjoberg.com/gtsummary/reference/tests.html
Recapping:
A statistically significant result does not indicate the size of the effect or its clinical relevance.
The clinical significance of a finding is determined by assessing whether the effect is large enough to influence medical practice or decision making.
With large samples, even insignificant differences can be statistically significant.
With small samples, even large differences can be statistically non-significant.
Important
Report the effect size
gtsummary:trial |>
select(age, trt) |>
tbl_summary(by = trt,
statistic = age ~ "{mean} ({sd})",
digits = age ~ c(2, 2)) |>
add_difference()| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
Difference2 | 95% CI2 | p-value2 |
|---|---|---|---|---|---|
| Age | 47.01 (14.71) | 47.45 (14.01) | -0.44 | -4.6, 3.7 | 0.8 |
| Unknown | 7 | 4 | |||
| Abbreviation: CI = Confidence Interval | |||||
| 1 Mean (SD) | |||||
| 2 Welch Two Sample t-test | |||||
t_test().\[d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}\]
\(\bar{x}_1, \bar{x}_2 = \text{Sample means}\)
\(s_p = \text{Pooled standard deviation}\)
The d statistic redefines the difference in means as the number of standard deviations that separates those means.
R:Cohen's d | 95% CI
-------------------------
-0.03 | [-0.32, 0.25]
- Estimated using pooled SD.
| d value | Rough interpretation |
|---|---|
| 0.2 ≤ d < 0.5 | Small effect |
| 0.5 ≤ d < 0.8 | Moderate effect |
| d ≥ 0.8 | Large effect |
A difference smaller than 0.2 standard deviations is considered trivial, even if statistically significant.
gtsummary:trial |>
select(age, trt) |>
tbl_summary(by = trt,
statistic = age ~ "{mean} ({sd})",
digits = age ~ c(2, 2)) |>
add_difference(test = age ~ "cohens_d")| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
Difference2 | 95% CI2 |
|---|---|---|---|---|
| Age | 47.01 (14.71) | 47.45 (14.01) | -0.03 | -0.32, 0.25 |
| Unknown | 7 | 4 | ||
| Abbreviation: CI = Confidence Interval | ||||
| 1 Mean (SD) | ||||
| 2 Cohen’s D | ||||
gtsummary:trial |>
select(age, trt) |>
tbl_summary(by = trt,
digits = all_continuous() ~ c(2, 2)) |>
add_difference(test = age ~ "wilcox.test")| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
Difference2 | 95% CI2 | p-value2 |
|---|---|---|---|---|---|
| Age | 46.00 (37.00, 60.00) | 48.00 (39.00, 56.00) | -1.0 | -5.0, 4.0 | 0.7 |
| Unknown | 7 | 4 | |||
| Abbreviation: CI = Confidence Interval | |||||
| 1 Median (Q1, Q3) | |||||
| 2 Wilcoxon rank sum test | |||||
gtsummary:trial |>
select(response, trt) |>
tbl_summary(by = trt) |>
add_difference(test = response ~ "prop.test")| Characteristic | Drug A N = 981 |
Drug B N = 1021 |
Difference2 | 95% CI2 | p-value2 |
|---|---|---|---|---|---|
| Tumor Response | 28 (29%) | 33 (34%) | -4.2% | -18%, 9.9% | 0.6 |
| Unknown | 3 | 4 | |||
| Abbreviation: CI = Confidence Interval | |||||
| 1 n (%) | |||||
| 2 2-sample test for equality of proportions with continuity correction | |||||
gtsummary does not support calculating relative risks using the add_difference() function.
gtsummary does not support calculating relative risks using the add_difference() function.
risk ratio with 95% C.I.
estimate lower upper
Drug A 1.000000 NA NA
Drug B 1.142493 0.7528554 1.733785
The incidence of tumour response is 14% higher in the group treated with drug B than in the group treated with drug A.
gtsummary does not support calculating odds ratio using the add_difference() function.
odds ratio with 95% C.I.
estimate lower upper
Drug A 1.000000 NA NA
Drug B 1.212851 0.6588486 2.244154
The odds of tumour response is 21% higher in the group treated with drug B than in the group treated with drug A.
gtsummaryLink: https://www.danieldsjoberg.com/gtsummary/reference/tests.html
Handle with care
…life is not so easy!
Applied Biostatistics Course with R